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Abstract — In this work a new way to calculate the multivari- 
ate joint entropy is presented. This measure is the basis for a 
fast information-theoretic based evaluation of gene relevance in 
a Microarray Gene Expression data context. Its low complexity 
is based on the reuse of previous computations to calculate 
current feature relevance. 

The /i-TAFS algorithm -named as such to differentiate 
it from previous TAFS algorithms- implements a simulated 
annealing technique specially designed for feature subset 
selection. The algorithm is applied to the maximization of 
gene subset relevance in several public-domain microarray 
data sets. The experimental results show a notoriously high 
classiflcation performance and low size subsets formed by 
biologically meaningful genes. 

Keywords -Feature Selection; Microarray Gene Expression 
Data; Multivariate Joint Entropy; Simulated Annealing. 

I. Introduction 

In cancer diagnosis, classification of the different tumor 
types is of great importance. An accurate prediction of 
different tumor types provides better treatment and toxicity 
minimization on patients. Traditional methods of tackling 
this situation are primarily based on morphological charac- 
teristics of tumorous tissue [1 J. These conventional methods 
are reported to have several diagnosis limitations. In order 
to analyze the problem of cancer classification using gene 
expression data, more systematic approaches have been 
developed |2|. 

Pioneering work in cancer classification by gene ex- 
pression using DNA microarray showed the possibility to 
help the diagnosis by means of Machine Learning or more 
generally Data Mining methods IS), which are now ex- 
tensively used for this task Q. However, in this setting 
gene expression data analysis entails a heavy computational 
consumption of resources, due to the extreme sparseness 
compared to standard data sets in classification tasks [5 |. 

Typically, a gene expression data set may consist of 
dozens of observations but with thousands or even tens 
of thousands of genes. Classifying cancer types using this 



very high ratio between number of variables and number of 
observations is a delicate process. As a result, dimensionality 
reduction and in particular feature subset selection (FSS) 
techniques may be very useful. Finding small subsets of 
very relevant genes among a huge quantity could derive in 
much specific and efficient treatments. 

This work addresses the problem of selecting a subset of 
features by using the TAFS (Thermodynamic Algorithms for 
Feature Selection) family of methods for the FSS problem. 
Given a suitable objective function, the algorithm makes uses 
of an special-purpose simulated annealing (SA) technique to 
find a good subset of features that maximizes the objective 
function. A distinctive characteristic of TAFS over other 
search algorithms for FSS is its probabilistic capability 
to accept momentarily worse solutions, which in the end 
may result in better hypotheses. Despite their powerful 
optimization capability, SA-based search algorithms usually 
lack execution speed, involving long convergence times. In 
consequence, they have been generally excluded as an option 
in FSS problems, let alone in highly complex domains such 
as microarray gene expression data. A few contributions us- 
ing the classical SA algorithm for FSS are found in prostate 
protein mass spectrometry data (|6], marketing apphcations 
fT\, or parameter optimization in clustering gene expression 
analysis |8|. 

Our answer to these computational problems is twofold. 
First, we use a filter objective function for FSS (thus 
avoiding the development of a predictive model for every 
subset evaluation). Second, the objective function itself 
is evaluated very efficiently based in the reutilization of 
previous computations. 

Specifically, a new way to calculate the multivariate joint 
entropy for categorical variables is presented that is both 
exact and very efficient. This measure is then used by 
a SA-based TAFS algorithm to search for small subsets 
of highly relevant genes in five public domain microarray 
data sets. Classification experiments yield some of the best 



results reported so far for these data sets and offer a drastic 
reduction in subset sizes. 

The paper is organized as follows: section II briefly 
reviews the Simulated Annealing technique; section III 
reviews and motivates the previous Thermodynamic Algo- 
rithms for feature subset selection; section IV develops the 
information- theoretic measure for feature relevance and its 
efficient implementation; section V describes the data sets 
and the experimental settings; section VI presents the results 
and their interpretation. The paper ends with the conclusions 
and directions for future work. 

II. Simulated Annealing 

Simulated Annealing (SA) is a stochastic technique in- 
spired on statistical mechanics for finding (near) globally 
optimal solutions to large optimization problems. SA is a 
weak method in that it needs almost no information about 
the structure of the search space. The algorithm works by 
assuming that some parts of the current solution belong to a 
potentially better one, and thus these parts should be retained 
by exploring neighbors of the current solution. Assuming 
the objective function is to be minimized, then SA would 
jump from hill to hill and hence escape or simply avoid 
sub-optimal solutions. 

When a system 5* (considered as a set of possible states) 
is in thermal equilibrium (at a given temperature T), the 
probability that it is in a certain state s, called Pt{s), 
depends on T and on the energy E{s) of the state s. This 
probability follows a Boltzmann distribution: 



Pt{s) = 



exp 



E{s) 
kT 



z 



, with Z = cxp 

ses 



kT 



where k is the Boltzmann constant and Z acts as a normal- 
ization factor Metropolis and his co-workers developed a 
stochastic relaxation method that works by simulating the 
behavior of a system at a given temperature T jOj. Being s 
the current state and s' a neighboring state, the probability 
of making a transition from s to s' is the ratio Pt{s — > s') 
between the probability of being in s and the probability of 
being in s': 



Pt{s -¥ s') = 



Pt{s') 



exp 



AE \ 



(1) 



Pt(s) V kT J 

where we have defined AE = E{s') — E{s). Therefore, the 
acceptance or rejection of s' as the new state depends on the 
difference of the energies of both states at temperature T. 
If -Pt(s') > Pt{s) then the "move" is always accepted. 
It Pt{s') < Pt{s) then it is accepted with probability 
Pt{s, s') < 1 (this situation corresponds to a transition to a 
higher-energy state). 

Note that this probability depends upon the current tem- 
perature T and decreases as T does. In the end, there will 



be a value of T low enough (the freezing point), wherein 
these transitions will be very unlikely and the system will 
be considered frozen. In order to maximize the probability 
of finding states of minimal energy at every value of T, 
thermal equilibrium must be reached. To do this, according 
to Metropolis, an annealing schedule is designed to prevent 
the process from getting stuck at a local minimum. The SA 
algorithm introduced in [TOl consists in using the Metropolis 
idea at each temperature T for a finite amount of time. In 
this algorithm T is first set at a initially high value, spending 
enough time at it so to approximate thermal equilibrium. 
Then a small decrement of T is performed and the process 
is iterated until the system is considered frozen. 

If the cooling schedule is well designed, the final reached 
state may be considered a near-optimal solution. However, 
the whole process is inherently slow, mainly because of the 
thermal equilibrium requirement at every temperature T. 

III. Thermodynamic Algorithms for FSS 

In this section we review TAFS (Thermodynamic Algo- 
rithm for Feature Selection) and eTAFS, two algorithms for 
FSS that were originally designed for problems of moderate 
feature size (up to one hundred) ifTTl . If we consider FSS 
as a search of possible feature subsets of the full feature set 
X, then SA acts as a combinatorial optimization process 
ifTH . In diis sense, TAFS and eTAFS find a subset of 
features that optimize the value of a given objective function 
J : V{X) — > K. From now on, we assume that this function 
is to be maximized and that J > cQ. 

To this end, a special-purpose forward/backward mecha- 
nism is embedded into an SA algorithm, taking advantage 
of its most distinctive characteristic, the probabilistic accep- 
tance of worse scenarios over a finite time. This charac- 
teristic is enhanced by the notion of an e-improvement: a 
feature e-improves a current solution if it has a higher value 
of the objective function or a value not worse than e%. This 
mechanism is intended to account for noise in the evaluation 
of the objective function (caused either by the finiteness of 
the data set or introduced by the chosen resampling method). 

The pseudo-code of TAFS is depicted in Algorithm [T] 
The algorithm consists of two major loops. The outer loop 
waits for the inner loop to finish and then updates T 
according to the chosen cooling schedule. When this loop 
reaches Tmin, the algorithm halts. It keeps track of the best 
solution found (which is not necessarily the current one). 

The inner loop is the core of the algorithm and is 
composed of two interleaved procedures: Forward and Back- 
ward, that iterate until an equilibrium point is found. These 
procedures work independently of each other, but share 
information about the results of their respective search in 
the form of the current solution. Within them, FSS takes 



'This is the case for accuracy, mutual information, distances and many 
other useful measures. 



Algorithm 1: TAFS algorithm for feature selection 

input : X : Full Feature set{X\...Xn} 
J() : Objective Function 
a{) : Cooling Schedule 
€ : Epsilon 

To ; Initial Temperature 
Tmin '■ Final Temperature 

1 Xcur <— Initial current subset 

2 Jcur Initial objective function value 

3 T <— To Initial temperature 

4 whUe T > Tmin do 

5 repeat 

6 Y < — Xcur 

7 Forward (Xcur, Jcur) 

s Backward (Xcur, Jcur) 

9 until Y = Xcur 

10 T ^ a(T) 



Algorithm 2: Procedure Forward {Z, Jz are modified) 



Algorithm 4: Function >^ 



input : 
repeat 



Z,Jz 



argmax J(Z U {^i}) 

if >e (Z, x, true) then 

accept <— true 
else 

AJ^ J(Zu{x})- J(Z) 

A J 

rand(0, 1) < e r 



accept *r- 

if accept then 

|_ Z ^ Z U {x} 

if J{Z) > Jcur then 
|_ Jz^J(Z) 

12 until not accept 



Algorithm 3: Procedure Backward (Z, Jz are modified) 



input : 
repeat 



Z,Jz 



argmax J(Z \ {Xi}) 

if >e (Z,x, false) then 

I accept i— true 
else 

AJ^ J(Z\{x}) - J(Z) 
accept •<— rana!(0, 1) < e"? 

if accept then 
L Z^Z\{x} 

if J(Z) > Jz then 
|_ Jz^J(Z) 

12 until not accept 



place and the mechanism to escape from local minima starts 
working. These procedures iteratively add or remove features 
one at a time in such a way that an e-improvement is 
accepted unconditionally, whereas a non e-improvement is 
accepted probabilistically. The pseudo-code for Forward and 
Backward, and e-improvement is outlined in Algorithms |2l 
|3]and|4] When Forward and Backward finish their respective 
tasks, TAFS checks if the current solution is the same as it 



input : Z,x,d 
output: boolean 

1 if d then 

2 I Z' ^ ZU{x} 

3 else 

4 \_ Z' ^ Z\{x} 

5 Ax ^ J(Z') - J{Z) 

6 if Aa; > then 

7 I return true 

8 else 

9 1^ ri 



•eturn -jj^ < e 



was prior to their execution. If this is the case, then we 
consider that thermal equilibrium has been reached and T 
is adjusted, according to the cooling schedule. If it is not, 
another loop of Forward and Backward is carried out. 

A. eTAFS: an Enhanced TAFS Algorithm 

A modification to Algorithm [1] aimed at speeding up 
relaxation time is presented in this section. The algorithm 
-named eTAFS, see Algorithms |5] and |6]- is endowed with 
a feature search window (of size I) in the backward step, as 
follows. In forward steps always the best feature is added 
(by looking all possible additions). In backward steps this 
search is limited to I tries at random (without replacement). 
The value of I is incremented by one at every thermal- 
equilibrium point. This mechanism is an additional source 
of non-determinism and a bias towards adding a feature 
only when it is the best option available. On the contrary, 
to remove one, it suffices that its removal e-improves the 
current solution. Another direct consequence is of course 
a considerable speed-up of the algorithm. Note that the 
design of eTAFS is such that it grows more and more 
deterministic, informed and costly as it converges towards 
the final configuration. 

Algorithm 5: eTAFS algorithm for feature selection 

input : X : Full Feature set {Xi . . . Jf„} 

J() : Objective Function 
a{) : Cooling Schedule 
e : Epsilon 

To Initial Temperature 
Tmin Final Temperature 

1 Xcur <— Initial current subset 

2 Jcur •<— Initial objective function value 

3 T To Initial temperature 

4 / 2 Window size (for backward steps) 

5 while T > Tmin do 

6 repeat 

7 Y ^ Xcur 
Forward {Xcur, J cur, I) 

9 Backward (Xcur, J cur, I) 

10 until Y = Xcur 
T <- a{T) 
/ <- / + 1 



Algorithm 6: eTAFS Backward procedure (Z, Jz are 
modified). Note that Xq can be efficiently computed 
while in the for loop). 

input : Z, Jz,l 

1 A <- 0;Afl ^ 

2 repeat 

3 for i := 1 to min{l, \Z\) do 

4 Select X & Z \ Ag randomly 

5 if >e (Z, X, false) then 
|_ Au{x} 

Ab ^ AbU {x} 

Xq ^ argmax{ J(Z \ {X})} 

X&Ab 

9 if Xo G A then 

10 accept true 

else 

12 AJ ^ J{Z\{Xo}) - J{Z) 

AJ 

13 I accept •<— rana!(0, 1) < e t 

if accept then 
|_ Z^Z\{Xo} 

if J(Z) > Jz then 
|_ Jz^J(Z) 

18 until not accept 



IV. Information Theoretic Feature Relevance 

A. Entropy definitions 

Entropy, a main concept in information theory lfT3l . can 
be seen as an average of uncertainty in a random variable. 
If X is a discrete random variable with probability mass 
function p, its entropy is defined b}|5 

H(X) ^-Y^p{x) \ogp{x) = -Ex[\ogp(X)] (2) 

X 

being E[] the expectation operator. If a variable {X) is 
known and another one (Y) is not, the conditional entropy 
of Y with respect to X: is the mutual entropy with respect 
to the corresponding conditional distribution: 

i/(y|X) = -^^p(a^,2/)logp(y|x). (3) 



/(Xi,...,X„;r) =^/(X,;y|Xi,...,X,_i) 



i=l 



H{Y)-H{Y\Xi,...,X^). 



(5) 



Conditional MI is expressed in the natural way, by con- 
ditioning in (|4|i: 



I{X; Y\Z) = H{Y\Z) - H{Y\X, Z) 



(6) 



The MI has been used with success as for feature se- 
lection in machine learning tasks. Currently there is no 
agreed-upon definition of the general multivariate mutual 
information I{Xi] . . . ; Xn)- An existent proposal is the 
interaction information, described e.g. in |14| which, for the 
case of three variables X, Y, Z, is defined as I{X;Y; Z) = 
I{X; Y\Z)—I{X; Y). The extension to the multivariate case 
is in terms of the marginal entropies and is given by: 



; X„ 



- E (- 

rC{Xi,...,X„} 



l-li/(r). 



This definition is impractical due to its exponential char- 
acter. In the next section, the objective function J takes the 
form of an information-theoretic index of relevance based 
on the multivariate joint entropy, which has already been 
used elsewhere [15 |. One of the contributions of this paper 
resides in a fast implementation of the calculation and its 
application to microarray gene expression data. 

B. Incremental Multivariate Joint Entropy 

For a random variable X, it is known that the joint entropy 
obeys the following property: 



H{X,Y) > H{X) 



(7) 



This property says that joint entropy is always at least 
equal to the entropies of the original system: adding a new 
variable can never reduce the available uncertainty. If we 
rewrite d?) as an equation: 



From these two definitions another concept is build, the 
mutual information (MI), which can be interpreted as a 
measure of the information that a random variable has or 
explains about another one. 



I{X- Y) = H(Y) - H(Y\X) = Ex,Y [log 



p{x)p{y) 



(4) 



The computation of the MI can be extended from the 
bivariate to the multivariate case of a number n > 2 of 
variables, against another one: 

^AU log are to base 2. 



H{X,Y) = H{X) + Ax{Y), 



(8) 



then Ax{Y) > represents the increment in entropy due 
to the addition of the variable Y to the system. In a feature 
selection setting, given Z a class variable, t C X the current 
subset and H{t) its joint entropy, if a new feature Xi e X\t 
is considered for possible inclusion in the current subset: 



H{Z, T U {X,}) = H{Z, r) + AzAX^) 



(9) 



It turns out that, to obtain the next calculation, it is 
computationally far more advantageous to store H{Z, r) and 
calculate the quantity Az,T{Xi) than to compute the fuU 







-P(Xi) logP(Xi) 





0.538 


0.481 
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0.462 


0.515 




H(X^) = 


0.996 



Xi 
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P{Xi,X2) 


-P{Xi,X2) logP(Xi,X2) 








0.231 


0.488 
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0.308 


0.523 






H(Xi,X2) = 


1.011 


1 





0.154 


0.415 


1 


1 


0.308 


0.523 






H(XuX2) = 


0.939 



Table 1 

Marginal Entropy Scheme (MES) TABLES FOR ONE VARIABLE (LEFT) AND THE ADDITION OF A SECOND VARIABLE (RIGHT). P( ) IS THE 
PROBABILITY MASS FUNCTION, OBTAINED FROM THE DATA (ALL ENTROPIES ARE IN BITS). 



joint entropy H{Z^TiJ{Xi}) directly. In order to obtain this 
value, and incremental procedure to calculate multivariate 
joint entropy has been developed, as described in the sequel. 

The incremental multivariate joint entropy (|9) must be 
computed at every evaluation step involving a possible 
candidate feature Xi to be included in the current subset 
T. Throughout the process, r is associated with its current 
Marginal Entropy Scheme (MES), a table storing the unique 
values contained in the data set for its forming features and 
its corresponding entropy value. An example of a MES table 
for two binary variables {Xi, X2] is shown in Table U 

At the initial step (r = 0) the MES table for the addition 
of Xi to is indicated in the left part of Table |T] The two 
unique values and their entropies H{Xi = 0) = 0.481 
and H{Xi = 1) = 0.515 are calculated. Let us suppose 
that a feature X2 is to be evaluated w.rt the current subset 
T = {^i}- The MES table with its unique forming patterns 
are indicated in the right part of Table U We can see that 
by introducing X2 to the current subset r, four partitions 
are generated for each unique value of Xi: {00, 01, 10, 11}. 
In the particular case of Xi — 0, a change in its entropy 
contribution is produced by the action of X2 by splitting 
it into two entropy values: H{Xi = 0,^2 = 0) = 0.488 
and H{Xi = 0,^2 = 1) = 0.523, for a total entropy 
of H{Xi = 0,^2) = 1.011. The increment in entropy 
is obtained as the difference between the current MES 
(considering the addition of X2) and the previous scheme 
(without it) -see Table |II] 





H{XuX2) 


-P(Xi) logP(Xi) 


difference 


A^iXi = 0) 


l.OIl 


0.481 


0.531 


A,(Xi = 1) 


0.939 


0.515 


0.424 






A^ 


0.954 



Algorithm 7: Incremental Multivariate Joint Entropy 

input : r: CuiTent subset; 

X^: feature to be added; 

Ht ■ Current subset joint entropy; 

Et '■ Marginal entropies scheme of Ht ', 

D : Data set; 
output: r, Ht, Et 
1 if |r = then 

T ^ {X^} 

D ^ Sort(D) 
Ht <— Joint Entropy of D 
Et <— M arginalEntropyScheme{D\T) 



12 
13 
14 
15 



else 



r+ ^ r U {X,} 
Sort{D\T+) 

E^+ -i— M arginalEntropyScheme{D\T~^) 

II j runs through the values of r 



Ht Ht + Ar 

Et E_+ II new MES 



Table II 

At computations FROM THE Marginal Entropy Scheme -SEE Table[I] 



4-5), which will be taken as input to the next computation. 
Note that these two lines can be efficiently implemented as 
one function and using only one loop-cycle, with complexity 
9{\D\), where \D\ is the number of training instances. 

In the else part of the if clause, the MES is calculated 
with the addition of Xi to the current subset t (named 
Et+)- Taking into account that previous MES inherits the 
ordering sequence derived from a previous stage (because 
of lines 5 and 9), entropies generated by changes in the 
MES given by t U {Xi} are summed (E^-) in groups (line 
11) by the newly formed patterns, rendering a one-to-one 
correspondence between previous MES and current MES. 



Finally, this last value is applied to eq. (|9|l to obtain the 
joint entropy H{Xi,X2) = H{Xi) + A^(X2) = 0.996 + 
0.954 = 1.950. The Ustings in Algorithms |f] and M show 
the pseudo-code to compute the procedure explained above. 
The notation D \ r stands for the restriction of the dataset D 
to the features in r. 

Initial entropy is evaluated in lines 2-5. This first step 
calculates starting joint entropy as well as its first MES (lines 



Algorithm 8: MarginalEntropyScheme Function 

input : D : Data set; 
output: E 

1 foreacli unique value v in D do 

2 1^ T [d] <— fraction of instances in D with value v 

3 E <— H{T) llcalculate entropy of this distribution 



Thus, the entropy contribution /\T{Xi), showing the ef- 
fect of adding Xi to r, is computed by the difference in both 



MESs (line 11), being finally added to the current entropy 
H-r (line 13). The implementation of lines 10-11 follows the 
same consideration as lines 4-5, and hence complexity is in 
the same order. 

The incremental multivariate joint entropy is used to ob- 
tain an index of relevance (acting as the objective function) 
of a feature Xi ^ X Xo a. class Z with respect to a subset 
T ^ X \ Xi and is defined by: 



J{X,-Z\t) 



H{Z) + H{T,Xi)^H{T, Z,X,) 
H{Z) 



(10) 



Note the denominator acts as a normalization factor, such 
that J e [0,1], with J = 1 corresponding to the highest 
relevance. The reward of using this objective function by a 
TAFS-like algorithm consists in the possibility of testing it 
in highly complex domains such as microarray data sets. We 
name the combination of eTAFS and the objective function 
in eq. (fTOl l as the /i-TAFS algorithm. 

V. Experimental Work 

To compute the necessary entropies described in previous 
section, a discretization process is needed. This change of 
representation does not often result in a significant loss of 
accuracy (sometimes significantly improves it \16\, |T7l); it 
also offers reductions in learning time [ 18] . In this work, the 
CAIM algorithm was selected for two reasons: it is designed 
to work with supervised data, and does not require the user 
to define a specific number of intervals lfT9l . 

A. Data sets 

Five public-domain microarray gene expression data sets 
are used to test and validate the approach proposed in 
this work: Colon Tumor: 62 observations of colon tissue, 
of which 40 are tumorous and 22 normal, 2,000 genes 
pOl. Leukemia: 72 bone marrow observations and 7,129 
probes: 6,817 human genes and 312 control genes 13|. The 
goal is to tell acute myeloid leukemia (AML) from acute 
lymphoblastic leukemia (ALL). Lung Cancer: distinction be- 
tween malignant pleural mesothelioma and adenocarcinoma 
of lung 11271 : 181 observations with 12,533 genes. Prostate 
Cancer: used in (2T\ to analyze differences in pathological 
features of prostate cancer and to identify genes that might 
anticipate its clinical behavior; 136 observations and 12,600 
genes. Breast Cancer: 97 patients with primary invasive 
breast carcinoma; 12,600 genes analyzed [,23] . 

B. Settings 

Provided that the core nature of the /i-TAFS algorithm, 
and even many other algorithms -e.g. Genetic Algorithms, 
Neural Networks-, resides in their stochasticity, several runs 
have to be performed, in order to better asses the average 
behaviour of the methods. 

The experimental design to test /i-TAFS algorithm mea- 
sures performance by carrying out m = 100 different 



independent runs. In each run, /i-TAFS is executed on 
the corresponding dataset and returns the set of all those 
feature subsets reaching the best found performance func- 
tion (maximum relevance, in this case). To overcome the 
existence of many solutions, the subset that offers the lowest 
mutual information (MI) among its elements -i.e. the less 
redundancy- is taken as the subset delivered in this run. 
After completing the m execution runs, the obtained subsets 
can be ordered from minimum to maximum MI value. 

The /i-TAFS parameters are as follows: e — 0.01, Tq = 
0.1 and T,nin — 0.0001. These settings were chosen after 
preliminary fine-tuning and are kept constant for all the 
problems ifTTl . The cooUng function was chosen to be 
geometric a{t) = 0.9 1, following recommendations in the 
literature llT2l . 



Data set 


Time 


Jeval 


size 


Colon Tumor 


6.41 


503,901 


6.93 


± 


0.06 


Leukemia 


6.51 


506,489 


3.36 


± 


0.06 


Lung Cancer 


7.45 


560,972 


2.58 


± 


0.04 


Prostate Cancer 


98.74 


7,119,800 


9.85 


± 


0.05 


Breast Cancer 


136.93 


10,943,628 


9.62 


± 


0.03 



Table III 

/i-TAFS RUNNING PERFORMANCE. Time INDICATES THE AVERAGE 
RUNNING TIME (IN MINUTES) OVER THE 100 EXECUTIONS; 7eva/ IS THE 
AVERAGE NUMBER OF EVALUATIONS OF J; size THE AVERAGE SIZE OF 
THE FINAL SOLUTIONS AND ITS STANDARD ERROR. 



VI. Experimental results 
A. fi-TAFS performance results 

The evolution of /i-TAFS from a high temperature state to 
a frozen point is depicted in Fig.[T] Highly unstable -i.e. high 
temperature condition- readings are observed at the initial 
stages in each of the datasets. As soon as the algorithm 
becomes more relaxed due to eq. ([T]i, worse solutions are 
avoided. The frozen condition is observed at the final stages 
of each execution, where J values consecutively reach the 
maximum possible value (J — 1) in all cases. 

The running performance of /i-TAFS is summarized in 
Table III. The results show that /i-TAFS yields subsets of 
considerably low size and also low variability. Notorious 
readings correspond to Leukemia and Lung Cancer. It is 
conjectured that such sizes respond to the nature of the pro- 
posed information theoretic model on discretized data sets, 
in the sense that only a few genes significantly contribute 
to increase the index of relevance given by eq. (fTOl . On the 
one hand, working with continuous features, the index would 
tend to vary smoothly -i.e. generating small increments; as a 
consequence, more features are added-deleted. On the other 
hand, discrete features variations are normalized by their 
discretization scheme, so small increments in the real-value 
are merged into a single discrete value. Therefore, mostly 
significant increments are truly reflected in its addition- 
deletion from the current subset. 
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Figure 1. /j-TAFS search processes. The x-axis is the iteration counter 
for the outer loop of the algorithm. 



Computational demands when processing the smaller data 
sets (as for Colon tumor, with 2,000 features) are kept 
by /i-TAFS considerable low (5 to 10 minutes). The two 
more complex problems. Prostate and Breast Cancer re- 
quire approximately 1.5 and 2 hours of total processing 
time. Unfortunately, there is scarcely any reporting on time 
consumption in recent scientific literature that would enable 
us to establish a reasonable comparison. 

B. ji-TAFS accuracy results 

Eight classifiers were evaluated by means of 10 times 
10-fold Cross Validation (10x10 CV), a resampling method 
designed to handle small-sized data sets. The chosen clas- 
sifiers are: the nearest-neighbor technique with Euclidean 
metric (kNN) and parameter k (number of neighbors running 
from 1 to 15), the Naive Bayes classifier (NB), a Lin- 
ear and Quadratic Discriminant classifier (LDC), Logistic 
Regression (LR), the Support Vector Machine with linear 
and quadratic kernel (ISVM and rSVM) and parameter C- 
regularization constant (with C = 2^, k running from —7 
to 7) and the Support Vector Machine with radial basis 
function kernel (rSVM) and parameter C and ^-smoothing 
in the kernel function (with 7 = 2*^, k running from —7 to 



70 The non parametric Wilcoxon signed-rank tesl0 is used 
for the (null) hypothesis that the median of the differences 
between the errors of the winner classifiers per data set 
and another classifier's error is zero. The non-parametric 
Wilcoxon signed-rank test will be used for the (null) hy- 
pothesis that the median of differences between classifiers 
accuracies are zero, at the 95% level of significance. 



Data set 


Classifier 


10x10 CV 


size 


Colon Tumor 


ISVM (C = 2^) 


89.19±0.38 


5 


Leukemia 


ISVM (C = 2-'') 


99.62±0.27 


3 


Lung Cancer 


LR 


99.89±0.07 


4 


Prostate Cancer 


NN (6) 


95.66±0.21 


7 


Breast Cancer 


rSVM (C = 2^,1 = 2"^) 


86.90±0.48 


6 



Table IV 

/iTAFS: 10X10 MEAN CROSS-VALIDATION ACCURACY (70x70 CV) 
COMPLEMENTED WITH ITS STANDARD ERROR FOR THE BEST MODEL IN 
EACH DATA SET. THE Classifier COLUMN INDICATES THE BEST METHOD 
ALONG WITH BEST PARAMETERS. 



The obtained solutions are displayed in Table |IV] Among 
the eight classifiers used to test the solutions, only the final 
model is presented. Lung Cancer, Leukemia and Prostate 
Cancer reach remarkably high accuracies, while Colon 
Tumor and specially Breast Cancer show lower 10x10 
CV readings. In all cases, the subset that delivers this 
performance is considerable small, having 7 genes or less 
(and only 3 genes in the Leukemia data set). Moreover, 
all Wilcoxon test p-values signal significant differences 
(p < 0.05) between the best method and all other methods 
in the corresponding data set, except for the ISVM vs. LR 
in Colon Tumor (p = 0.312). 

C. Discussion of the results 

It is a common practice to compare to similar works in 
the literature. Unfortunately, the methodological steps are 
in general very different, especially concerning resampling 
techniques, making an accurate comparison a delicate under- 
taking. Nonetheless, such a comparison is presented in Table 
[V] Seven references which are illustrative of recent work are 
indicated, including previous work from the authors. In this 
table the validation method, the best classifier and the best 
reported result are detailed (final accuracy and number of 
genes involved). 

The Colon Tumor data set presents difficulties in clas- 
sification, never reaching 90%. The solution delivered by 
/7-TAFS is comparable with the best known (that of BGS^ 
II25I ): however, it uses 5 genes against the 9 used by BGS^. 

'For the experiments, we use a MATLAB implementation; specifically, 
for the SVMs we use the MATLAB interface to LIBSVM 124]. All tests 
are run on on a regular x86 workstation. 

*The Wilcoxon signed-rank test is a non-parametric statistical hypothesis 
test for the analysis of two related samples, or repeated measurements on 
a single sample. It can be used as an alternative to the paired Student's 
t-test when the population cannot be assumed to be normally distributed. 
It should therefore be used whenever the distributional assumptions that 
underlie the t-test cannot be satisfied. 







Colon 




Lung 


Breast 


Prostate 


Work 


Validation 


Tumor 


Leukemia 


Cancer 


Cancer 


Cancer 


125J(F) 


lOxlOCV 


89.36 


97.89 


98.84 


83.37 


93.43 






(9,3NN) 


(2,NB) 


(4,LR) 


(12,1SVM) 


(3,10NN) 


(261(F) 


200-B.632 


88.75 


98.2 


- 


— 


— 






(14,1SVM) 


(23,1SVM) 


- 


- 


- 


I23(W) 


lOxlOCV 


85.48 


93.40 


— 


— 


— 






(3,NB) 


(2,NB) 


- 


- 


- 


IzaKW) 


100-RS 


87.31 




72.20 










(94,SVM) 




(23,SVM) 






(HKW) 


50-HO 


77.00 


96.00 


99.00 


79.00 


93.00 






(33,rSVM) 


(30,rSVM) 


(38,rSVM) 


(46,rSVM) 


(47,rSVM) 


(30) (FW) 


lOxlOCV 






99.40 




96.30 










(135,5NN) 




(79,5NN) 




lOCV 




98.6 


99.45 


68.04 


91.18 








(2,SVM) 


(5,SVM ) 


{8,SVM) 


(6,SVM) 



Table V 

Best results reported in the literature for the explored problems : (F) indicates that the referenced work uses a Filter-Based 
Algorithm, (W) for wrapper and (F-W) for a combination of both schemes; in parentheses, the size of the subset (number of 
genes) and the inducer optimized. a - sign indicates that the problem was not studied by the reference. the validations are: 
10x10 cv (10 times 10-fold cross validation), 10 cv (10-fold cross validation), 100-rs (100 times random subsampling), 50-ho 

(50 TIMES Holdout) and 200-B.632 (0.632 bootstrap of size 200). 



The other difficuh problem seems to be Breast Cancer. 
In this data set, ^-TAFS gives the best resuh among the 
references consulted, using also less genes and in front of 
solutions that employ a pure wrapper strategy. For the other 
three problems, /i-TAFS is also able to yield better solutions 
compared to other approaches, many of them using a much 
bigger gene subset. 

Expression levels for each model in the five data sets are 
given in Fig. |2] It is seen that each model posses genes 
that are visually identified as the ones that present irregular 
expression levels: Colon Tumor genes M76378 and T51288; 
Leukemia genes AFFX-CreX-5_at and L09209; Lung Can- 
cer gene 37157_at; Prostate Cancer genes 38322_at and 
37639_at; and Breast Cancer genes Contigl4882_RC, Con- 
tig53822_RC and Contig57657_RC. 



Data set 



Gene ID 



Colon Tumor M76378, H08393, T51849, M19311, T51288 
Leukemia AFFX-CreX-5_at, L09209, X75755 

Lung Cancer 37157_at, 33221_at, 107_at, 40790_at 
Prostate Cancer 1230_g_at, 38322_at, 37639_at, 32909_at, 660_at 

35998_at, 34107_at 
Breast Cancer AB014543, Contigl4882_RC, Contig53822_RC 
Contig57657_RC, Contig53713_RC, NM_006191 

Table VI 

Genes identification for each final model. 



D. Biological evidence in the solution subsets 

The genes corresponding to the solutions displayed in 
Table IV are detailed in Table VI. In the following, known 
biological evidence is presented about the effect of gene ex- 
pressions in each cancer disease. This evidence is assembled 
by examining recent relevant medical literature. 



Colon Tumor: 

• M76378 CSRPl -Cysteine and glycine-rich protein L 
This gene encodes a member of the cysteine-rich pro- 
tein (CSRP) family. It may be involved in regulatory 
processes important for development and cellular dif- 
ferentiation. Hypomethylation, a decrease in the epige- 
netic methylation of cytosine and adenosine residues in 
DNA, of CSRIPl and other genes were confirmed in 
the cancer cells by bisulfite sequencing |32|. 

. H08393 COLllAl-collagen, type XL alpha 2 (Homo 
sapiens). Two alpha chains of type XI collagen, a minor 
fibrillar collagen are encoded by this gene [33]. Up- 
regulation of this gene in the mucosa stromal cells 
of both familial adenomatosis polyposis and sporadic 
colorectal cancer has been detected [34]. 

• T51849 EPHBl-Tyrosine-protein kinase receptor elk 
precursor. EphBl is a member of receptor tyrosine 
kinases of the EphB subfamily and has been positively 
identified in the development, progress and prognosis 
of colorectal cancers 135|. 

• M19131 CALM2-calmodulin 2 (phosphorylase kinase, 
delta). Caml2 plays an important role in intracellular 
calcium signaling, which regulates a variety of cellular 
processes, such as cell proliferation and gene transcrip- 
tion ||36| . Increased expression levels of this gene were 
found in anaplastic large cell lymphoma cell lines ifJTl . 

• T51288 ASSl-argininosuccinate synthase (human). 
Arginine, a semi-essential amino acid in humans, is 
critical for the growth of human cancers as in primary 
ovarian, stomach and colorectal cancer, whose expres- 
sion levels read high values B38I . 

Leukemia: 

. AFFX-CREX-5 AT NOT IDENTIFIED. 
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Figure 2. Expression levels of winner models as indicated in Table V. 
Samples for each data set are distributed as follows: Colon Tumor. Tumor 
1-41, Normal 42-62; Leukemia: Tumor 1-48, Normal 49-72; Lung Cancer: 
Tumor 1-31, Normal 32-181; Prostate Cancer: Tumor 1-78, Normal 79- 
136; and Breast Cancer: Tumor 1-46, Normal 47-97; 



« L09209 AP LP 2 -amyloid beta (A4) precursor-like pro- 
tein 2 (Homo sapiens). The function of this gene is 
not fully understood, but it conjectured that may play a 
role in the regulation of hemostasis ||39l . This gene was 
reported as over-expressed by other scientific literature 
as in |40| 

. X95735_at ZYX-ZYXIN. It is involved in the spatial 
control of actin assembly and in the communication 
between the adhesive membrane and the cell nucleus 
BTl . This is a gene found in many cancer classification 
studies f3l, f42l, [43], and is highly correlated with 
acute myelogenous leukemia. 

Lung Cancer: 

> 37957_at ATG4-Autophagy related 4 homolog A. Au- 
tophagy is the process by which endogenous proteins 
and damaged organelles are destroyed intracellularly. 
Autophagy is postulated to be essential for cell home- 
ostasis and cell remodeling during differentiation, meta- 
morphosis, non-apoptotic cell death, and aging 



It is activated during amino-acid deprivation and has 
been associated with neurodegenerative diseases, can- 
cer, pathogen infections and myopathies 1.44 1 . 

• 3322 l_at PAXIPl-PAX interacting (with transcription- 
activation domain) protein L Member of the paired box 
(PAX) gene family, this gene plays a critical role in 
maintaining genome stability by protecting cells from 
DNA damage |39|, |45|. Analysis of pulmonary ade- 
nocarcinomas in experiment GDS1650 in |,33J records 
shows over-expression levels of this gene. 

• 40790_at BHLHE40-basic helix-loop-helix family, 
member e40. This gene encodes a basic helix-loop- 
helix protein expressed in various tissues, which is may 
be involved in the control of cell differentiation [33j. 
Experiments suggest that loss of DECl expression is 
an early event in the development of lung cancer ll46l 

• 107_at RAB40A-member RAS oncogene family. This 
gene encodes a member of the Rab40 subfamily of Rab 
small GTP-binding proteins that contains a C-terminal 
suppressors of cytokine signaling box |39|. No medical 
evidence was found in hterature about its role in cancer. 

Prostate Cancer: 

• 1230_g_at MTMRll-myotubularin related protein 77. 
Experiments on patients with acute lymphoblastic 
leukemia and with Burkitt lymphoma, three putative 
oncogenes or tumor suppressor genes were found, one 
of them was the MTMRll t47ll . 

• 38322_at PAGE4-P antigen family, member 4 (prostate 
associated). This gene is strongly expressed in prostate 
and prostate cancer; and also expressed in other tissues 
as in testis, fallopian tube, uterus, placenta, as well as 
in testicular cancer and uterine cancer [39 1. 

• 37639_at HPN-Hepsin. Hepsin is a cell surface serine 
protease and plays an essential role in cell growth and 
maintenance of cell morphology and it is highly related 
with prostate cancer, benign prostatic hyperplasia ll39l . 

« 32909_at AQP5-aquaporin 5. Acting as a water chan- 
nel protein, Aquaporins are a family of small integral 
membrane proteins linked to other proteins, whose 
role is the generation of saliva, tears and pulmonary 
secretions 139|. Experiments with cases of normal and 
epithelial ovarian tumors tissues suggest an important 
role of this gene in the tumorigenesis of the latter, and 
a possible relationship with the ascites formation of 
ovarian carcinoma P8l . 

« 660_at CYP24A1 -cytochrome P450, family 24, subfam- 
ily A, polypeptide 7. This gene encodes a member 
of the cytochrome P450 superfamily of enzymes. The 
cytochrome P450 proteins catalyze many reactions in- 
volved in drug metabolism and synthesis of cholesterol, 
steroids and other lipids ll39| . This gene has been 
reported as responsible for degradation of the active 
vitamin D metaboUte 1, 25-dihydroxy vitamin D3 which 



is known to be antimitotic in prostate cancer cells ||49l . 
. 35998_at Hypothetical protein LOC284244 
(LOC284244). No evidence found. 

• 34107_at PFKFB2-6-phosphofructo-2-kinase/fructose- 
2,6-biphosphatase 2. The protein encoded by this 
gene is involved in the synthesis and degradation of 
fructose-2,6-bisphosphate, a regulatory molecule that 
controls glycolysis in eukaryotes fW). It has been 
suggested that the induction of de novo lipid synthesis 
-process that protects cancer cells from free radicals 
and chemotherapeutics- by androgen requires the up- 
regulation of HK2 and PFKFB2 fSOll . 

Breast Cancer: 

• AB014543 CLUAPl-clusterin associated protein 1 
(Homo sapiens). This gene is highly expressed in 
osteosarcoma, ovarian, colon, and lung cancers [51]. 

. Contig57657_RC PAKl-p21 protein (Cdc42/Rac)- 
activated kinase 1 (Homo sapiens). This gene encodes 
a family member of serine/threonine p21 -activating 
kinases, known as PAK proteins, whose role is the 
regulation of cell motility and morphology f33l. Pakl 
is directly related with the Etk/Bmx protein, acting this 
later as a control to the proliferation and tumorigenic 
growth of mammary epithelial cancer cells [52 1. 

. NM_006191 PA2G4-Prolife ration-associated 2G4, 
38kDa (PA2G4). Also known as EBPl, this gene 
encodes an RNA-binding protein that is involved in 
growth regulation ll39l . The EBPl has been shown to 
be a transcriptional corepressor that inhibits the growth 
of human breast cancer cell lines (53]. 

. Contigl4882_RC, Contig53822_RC, Con- 
tig53713_RC NOT IDENTIFIED. 

VII. Conclusions 

A new algorithm for feature selection using Simulated 
Annealing guided by the discrete multivariate joint entropy 
has been introduced and evaluated. Our experimental results 
concern the search for small subsets of highly relevant 
genes in five public domain Microarray Gene Expression 
data samples. The very promising results indicate that the 
algorithm offers a promising and general framework for 
feature selection in very high dimensional data sets. 

The entropic relevance measure has shown to be a good 
candidate as the objective function to be optimized by the 
algorithm. The reported classification results are competitive 
to current standards in analyzing microarray gene expression 
data with a very affordable execution time. This last aspect 
should not be overlooked, since database size is constantly 
growing and the complexity of optimization scenarios (that 
make extensive use of resampling methods) is ever greater 
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